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Abstract 

We investigate the influence of different kinds of structure on the 
learning behaviour of a perceptron performing a classification task de- 
fined by a teacher rule. The underlying pattern distribution is permit- 
ted to have spatial correlations. The prior distribution for the teacher 
coupling vectors itself is assumed to be nonuniform. Thus classification 
tasks of quite different difficulty are included. As learning algorithms 
we discuss Hebbian learning, Gibbs learning, and Bayesian learning 
with different priors, using methods from statistics and the replica for- 
malism. We find that the Hebb rule is quite sensitive to the structure 
of the actual learning problem, failing asymptotically in most cases. 
Contrarily, the behaviour of the more sophisticated methods of Gibbs 
and Bayes learning is influenced by the spatial correlations only in an 
intermediate regime of a, where a specifies the size of the training set. 
Concerning the Bayesian case we show, how enhanced prior knowledge 
improves the performance. 

1 Introduction 

In the statistical physics of neural networks one of the most important 
paradigms is the learning of a rule from examples, [|l|, 0]. The simplest 

*Based on the Diploma thesis of G. Dirscherl, Regensburg 1996 
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case is that (i) the rule can be represented by a "teacher perceptron", 
while (ii) at the same time the neural network, which tries to learn 
the rule, is also given by a perceptron, called the "student". How- 
ever, although much is known on this generalization problem, at least 
for single-layer perceptrons, see e.g. |^ and references therein, two 
simplifying assumptions are usually made, namely that (a) the "rule" 
itself, and (b) the examples, are both completely random, i.e. (a) with- 
out correlations between the components Bi, i = 1, N, of the teacher 
perceptron's coupling vector B connecting the input units i to the 
output unit, and (b) without correlations between the components 
with different i and/or /i, respectively, of the inputs 

In practical cases there exist of course such correlations, i.e. both 
spatial correlations (e.g. in the 'rule' B, i.e. between different compo- 
nents Bi of the teacher perceptron, and/or in the components of the 
vectors ^ representing the inputs to be classified by the system) and also 
semantic correlations (e.g. two different inputs ^'^ and ^'^ may represent 
different 'handwritings' of the same word). Here we only mention that 
storage problems with semantic correlations have been treated in p, ^ 
and concentrate in the following on spatial correlations, by assuming 
that all patterns ^'^ are drawn independently from the same non-trivial 
probability distribution, see below. In context with the simpler 'stor- 
age capacity problem', spatial correlations have already been treated 
in but the 'correlated generalization problem' itself, which is 

the focus of our paper, has not yet been studied, as far as the authors 
know, except in a paper of Tarkowski and Lewenstein, ^ , where only 
the special case of Gibbs learning with uncorrected teacher couplings 
was discussed. 

In all these papers on correlated patterns, only single-layer 

perceptrons have been considered, whereas for uncorrelated patterns 
the generalization problem has also been extensively treated for mul- 
tilayer perceptrons. Although a lot of interesting results, which may 
also be of practical relevance, have been obtained for these more re- 
alistic multilayer networks, see e.g. [0, |^, this was for uncorrelated 
systems and uncorrelated tasks only. Moreover, it has turned out in 
these and similar studies that multilayer networks cannot be treated 
successfully without a proper understanding of the behaviour of the 
single-layer sub-perceptrons, which are the building blocks of the mul- 
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tilayer systems. Therefore we concentrate here on those " pre-requisite 
single-layer perceptrons" , treating the influence of spatial correlations 
on the generalization ability of these simplest neural networks. As we 
will see, this influence can be useful or detrimental, depending on the 
task and on the system. If possible, we mention explicitly in the text, 
or at the end in the discussion, which of our results can be transferred 
to multilayer systems and can perhaps be used in some kind of 'strat- 
egy'. Nevertheless one should stress at this place that the single-layer 
perceptron itself has become recently a quite popular and successful 
classifier in so-called support vector machines, 0, and is more than 
just a toy model. - Thus far the motivation of the following. 

In our paper we consider exclusively the case of so-called hatch learn- 
ing, i.e. the 'student system' is always trained with all examples, which 
are kept in mind without any preference, and is forced to classify not 
only the last training example, but all members of the training set cor- 



rectly, whereas with the so-called "online learning" (see e.g. [|T^) at 
every training step a. new pattern is presented to the student and the 
student only uses this newly added example in the training. Extend- 
ing our work to multilayer perceptrons for 'batch learning' would be in 
fact rather expansible whereas it is much easier for the case of online 
learning. These questions are under investigation. 

In the following, by analytical methods we study therefore the gen- 
eralization problem "with spatial structure" as specified below; a " stu- 
dent perceptron" is considered, trying to learn by batch-algorithms a 
rule given by a "spatially structured teacher perceptron". The set of 
training examples itself is spatially structured, too, and we study, how 
the student takes over the spatial correlations inherent in the training 
examples and in the teacher perceptron, and how the generalization 
ability depends on these parameters as a function of the size a of the 
training set. The main problem is of course, how the spatial structure 
can be used most effectively, implicitly or explicitly, by the learning 
process considered. As learning algorithms we study Hebbian learning, 
Gibbs learning, and Bayesian learning, using statistical methods and 
the replica formalism. Although the spatial structure of the patterns 
and of the teacher machine does not matter asymptotically for a — oo 
in the two last-mentioned cases (see below), we find that the correla- 
tions, as well as enhanced prior information in the Bayesian case, can 
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be quite useful at intermediate values of a. 

Concerning the spatial structure considered below, we concentrate 
on the basic case of segmentation - or more general quasi-segmentation, 
see below - of the system into a finite, or infinite, number of segments, 
which have a finite mutual correlation between the activity of the neu- 
rons belonging to the same resp. different segments, and similarly par- 
titioned correlations (but with different strengths) of the synaptic cou- 
plings joining these neurons. Real data has such correlations, and it is 
usually part of preprocessing the data to detect such global dependen- 
cies, e.g. by Principal- Component Analysis (PCA, see e.g. chap. 8 in 
n| , or Although the simplest case we consider, spatial correla- 



tions corresponding to just two segments of equal size, is a restriction, 
the basic properties can actually be investigated quite clearly. On the 
other hand it is rather natural to assume similar correlations in the 
classifying 'teacher rule' as well as in the patterns; this reflects the fact 
that similarities in the properties of typical patterns correspond to a 
similar impact on the classification labels of the patterns. This is again 
a property encountered in practice. More details are given below. 



2 Basic Definitions 

We consider as usual a system with binary input patterns C,^ = , ^jv) 
where the are ±1. These input patterns generate at the teacher and 
student perceptrons, respectively, the so-called post-synaptic fields 

1 ^ 1 ^ - 

and 

1 ^ 1 ^ - 



The corresponding outputs are ctb := sign/i^, which is the "correct 
output", given by the teacher, and aj := sign/ij. The stability of the 
student's output - if it is correct - is given by the positive quantity 
K:=aBJ-i/{\J\VN). 

As usual, the generalization ability g{a) is defined as the proba- 
bility that the student, after training, produces the same output as 
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the teacher on a newly added random input, which does not belong 
to the training set. Here the "newly added random input" is spec- 
ified as follows: It should be different from the training inputs, but 
drawn from the same probability distribution, i.e. with the same spa- 
tial correlations (see below). The corresponding error probability is 
e := 1 — g{a). If there are no correlations, e is given as usual by the 
overlap r := {J-B)/{\J\ ■ \B\) of the coupling vectors of the two percep- 
trons, by g{a) = 1 — (l/vr) arccos(r), see e.g. 0. With correlations, 
however, the following non-trivial pattern- and (teacher-) phase-space 
correlation matrices come into play for i,j = 1, N: 

C^^C^:=Uj)^, and ^ := (5,5,)^^ . (3) 

(For i = j these correlations are of course trivial, i.e. = 1, Cf^ = 
B'^/N (also =1 without restriction).) The brackets {...)^Tesp. (...)^ im- 
ply ensemble-averages with the corresponding binomial resp. Gaussian 
probability densities, e.g. 



P{B) = [(27r)^DetC^]-^/2gxp 



-I E B^iC^rB, 



(4) 



In the following we skip the sub-indexes ^ and B for simplicity, since 
we additionally assume that the system is self- averaging; i.e. for al- 
most all configurations of the patterns ^ and of the teacher perceptron 
B considered, the same correlation matrices and C^, and also the 
expressions defined below, can not only be obtained by the ensemble- 
averages (...)^ resp. {■■■)§, but also for fixed realization by averaging 
over equivalent pairs of sites {i,j) in the limit of infinitely large sys- 
tems, N —>■ oo, see below. Moreover, as already mentioned, we exclude 
semantic correlations by requiring that for different patterns and ^'^ 
one always has (^f^J) = for i,j = 1,...,N. With these definitions 
one gets additionally the important parameters 

N N 

ij=l i,j=l 

N N 

S := {{hj)') = Y: WM^) = E Cr,W,) , (6) 

i,j=l i,j=l 
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and 

N 

R:={hj-hB) = N~'Y.CS{JiB,). (7) 

Here T is fixed by the "teacher rule" and the spatial pattern correla- 
tions, while 5* and R change in course of the learning process. 

As already mentioned, our paper is motivated by the natural as- 
sumption that the spatial pattern-correlations, and the phase-space 
correlations as well, i.e. spatial correlations in the couplings, corre- 
spond structurally to a segmented system in a similar way as words are 
segmented into letters, but recognized as a whole, |T^. Such a segmen- 
tation arises implicitly or explicitly in a lot of application tasks. It is 
also natural to assume that pattern- and phase-space correlations are 
segmented in the same way, which means that the correlation matrices 
have the same eigenvectors = (ef, e^), with k = 1, A^, although 
the corresponding eigenvalues and may be drastically different 
([0, ^). In fact, only this agreement of the eigenvectors is what we 
postulate in the following, when talking of " the general quasi- segmented 
case" . Moreover, we often specialize below to "the simplest segmented 
case" by making the natural assumption of only two segments of the 
same size: 

with 

(e°el) = 5MC, , {^o = {^^) = s,, , (9) 

and analogously B := {B °, ^) = (5°, B%/^, B\, B]^/^) with 

{B^B])=6,,c, , {B^B^) = {BlB])=5,, (10) 

for i,j = 1,..., N/2. The correlation parameters Cp and q have to 
be smaller than 1 in magnitude, but otherwise they can be arbitrary 
real numbers. During the training process, also the student perceptron 
develops a similar segmentation with a correlation parameter Cg. 

In the "general quasi-segmented case", the generalization ability 
g{a) is obtained from the three parameters T, S and R defined in 
eqs. d), (D and hy g = 2 dhj dhs P{hj,hB), with 
P(hj, Hb) = {2nVST - /?2)-i exp [-{Shi + Th] - 2R he hj)/{2{ST - R^))]. 
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The result is 

q = I arccos , | . (11) 

For the "simplest segmented case" defined through Eqs. (||), (||) and 
([T0|) , this general result is specialized, by evaluation of S, T and R, to 

1 1 / r + CpCd \ , . 

1 — — arccos = , (12) 

\^J{1 + CpCs){l + CpCt) J 



n 



where 

'''' 2 ■ |Ji| ■ |J0|/ ^ ^ 

is the cross-correlation between the different segments of the student's 
and teacher's coupling vectors. 



3 Hebbian learning 

At first, we shortly consider Hebbian learning, although this learning 
prescription generally fails for a ^ cxd in the presence of correlations, 
which is not astonishing (see e.g. [|T5|) and strongly contrasts to Gibbs 
and Bayes learning (see below). However, as we will see, even in the 
presence of correlations the results for Hebbian learning are interesting, 
if the number p := aN of training examples is small compared to A^, 
i.e. for a <C 1. 

Hebbian learning is defined by the one-shot prescription 

J. = N-"' t sign (^] , (14) 



(15) 



^l=l 

which leads for the "general quasi-segmented case " to 



N r 



fc=l 



and 
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whereas T is fixed. Here and are the eigenvalues of the correla- 
tion matrices of Eqn. (0). From these general results one can evaluate 
the generalization ability simply via Eqn. ([TT]). For the "simplest seg- 
mented case" defined by Eqs. (|^) and (0) one obtains g{a) from 
Eqn. (|T2|); the final result for the error-probability e = 1 — g is then 



a{l + 2cpCt + c^p) 



e(a) = — arccos 

vr L\/«f (1 + + c2) + + CpCt){l + 3cpCt + Scj + c^Ct) 

(17) 

From this result for the " simplest segmented case" the following general 
conclusions can be drawn: 

• For small a, Hebbian learning is quite effective: The general- 
ization error e{a) decreases rapidly with increasing a as e{a) = 
1/2 -0(v/^). 

• Moreover, one can see from Fig. 1 that the decrease of the gen- 
eralization error is faster, if the correlations are "useful" (i.e. for 
CtCp > 0); whether this is the case or not, does of course not de- 
pend on the student, but only on the given training examples. 
I.e. if the choice of the training examples is the teacher's task, he 
(or she) should try to give examples which are in accordance with 
the spatial correlations inherent in the 'rule', such that CpCt > 0. 
On the other hand, what the student could do is to monitor the 
spatial correlations in the examples to get an estimate of Cp al- 
ready for rather small a. Then by comparison of the 'monitored' 
values of e(a) and Cp with Eqn. (0), he (or she) can estimate q 
(i.e. an important part of the rule to be discovered, which may be 
useful afterwards for Bayesian learning, see sections 5.2 and 5.3 
below, where different priors are considered). Of course, for the 
'general quasi-segmented case' this may be illusionary. 

• However, in the limit a ^ oo, the error of the Hebbian learning 
prescription does not converge to zero, but to 



Coo := liin e(a) = — arccos 

a^oo -ji 



1 + 2cpQ -I- d 



(1 CpCt) (1 + 3cpCi + + d,ct) 
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This residual generalization error for Hebbian learning is due to the 
fact that the correct value for the student structure, = q, is usually 
not achieved for a ^ oo, although e(a), as obtained with the Hebb 
rule, decreases monotoneously with increasing a. Already at this place 
we remark that, in contrast, for the Gibbs and Bayes algorithms e{a) 
always vanishes for a — > oo, and there the asymptotics of the limiting 
behaviour does even not at all depend on the correlations (see below). 

For the Hebbian case, the behaviour of eoo as a function of Cp for 
different values of q is plotted in Fig. 2. Obviously, with Hebbian learn- 
ing, correlations in the patterns usually lead to nonvanishing residual 
generalization error; moreover, as already mentioned, an opposite sign 
in the correlations of patterns and teacher vector, respectively, makes 
the learning task more difficult. (This observation will probably again 
transfer to more complicated networks.) Nevertheless, for fixed q, 
whatever the sign of QCp is, and although for sufficiently small values 
of \cp\ the error increases oc \cp\ with increasing \cp\, there is accord- 
ing to Fig. 2 finally a decrease down to in the residual error, if \cp\ 
increases beyond a certain value, which depends on q. This again is 
an important statement, which means that sufficiently strong spatial 
correlations in the patterns will almost always be useful. 

There are thus three limits where with Hebbian learning and fixed q 
a vanishing resisual generalization error is achieved for a ^ oo, namely 

(i) for uncorrelated pattern spaces (cp = 0); the value of q does not 
matter at all in this case, as can be seen already from Eq. (|T7p, since 
then e{a) = 7r^^arccos[l + (7r/2a)]^^/^, which vanishes for a — > oo as 
e = l/\/2TTa; 

(ii) for Cp = ±1, with q ^ (— Cp); in this case the pattern segments 
are identical up to ±1; this corresponds to an effective reduction of N 
to N/2, i.e. to a doubling of a, but otherwise the same result as for (i). 

(iii) for Ct = ±1, with Cp 7^ (— q); in this case one has 

e{a) = Tc~^ arccos{l/-y/l + tt{1 + Cp)/[2a(l ± Cp)^] }, which behaves for 
a ^ 00 as ^1 + Cp/[\/27ra (1 ± Cp)]. 

In contrast to (ii) and (iii), if q is not kept fixed, but if the point 
{cp,Ct) = (—1, 1) or (1, —1) is approached with fixed slope dct/dcp = 
—X, then, according to Eqn. (0), the residual error e^o is a decreasing 
function of x for < x < 00, with e^o = 1/2 (which corresponds to zero 
generalization ability) for x = 0+, via eoo = 1/4 for x = 1, to = 
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for X oo. At X = 0, where Coo vanishes, there is thus a discontinuity. 

Except (i), these are just pretty artificial cases, so the Hebb rule 
fails, if correlated patterns are to be learned. 

For Ct = 0, we have found that even the modified Hebb prescription 
of [Q, which corresponds to the matrix transformation J — K - J with 
K = (C"^ + where the pattern correlation matrix C'^ is given 

by Eqn. (^), while I is the N x N unit matrix and u an optimization 
parameter, would yield at most a ~30%-reduction of the generalization 
error e(a), although for z/ = Jl — also vanishes. 



4 Gibbs learning 

In case of Gibbs learning, the student perceptron is drawn at random 
from the so-called version space V, which consists exactly of all per- 
ceptrons which classify the training examples correctly. Tarkowski and 
Lewenstein, , have treated storage and generalization of spatially and 
semantically correlated patterns in perceptrons, but only for the spe- 
cial case of Gibbs learning with uncorrelated teacher couplings (C^ = I 
in Eqn. (^)). We extend their approach to C"^ ^ I and correct some of 



their results (see below), using E. Gardner's rephca method, |]T6|, |T7|. 
With the teacher field Ut := N~^/'^B ■ ^ {= hs in Eqn. (1)) and the dif- 
ferent student fields Ua ■= N~^^'^J"' ■ ^, where a = 1, 2, n enumerates 
the replicas, one obtains for general quasi-segmentation with Eqs. 
the following order parameters: 

N 

T:={u',) = N-'Y^C^Bl (19) 

k=l 
N 

Ra:={utUa) = N~'J2CkBkJk (20) 

k=l 

N 

Sa:={ul) = N-'Y^C^Jk? (21) 

k=l 
N 

Qab:={uaU,) = N-'Y.CkJkJk- (22) 

k=l 

Here the are again the eigenvalues of the pattern correlation matrix 
C^, while Bk and are the components of B resp. J"' in the corre- 
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sponding basis; the fields Ut and Ua can be generated from normally 
distributed, independent variables w, Vt and Va by 



«a = •jS-Qv,-^Qw . (24) 

The general result for the free energy, evaluated with the replica trick 
assuming replica symmetry, which is exact in the present case, is F = 
Extr [Fi + aF2] , where the energy term F2 is 

F2 = 2j BwH{xi)lnH{x2), (25) 

with xi:= Rw ■ [TQ - i?2)-i/2 and X2 = w ■ {Q/{S - Q))^/"^, where 
Dw := (27r)~-^/^ dw exp(— and H{x) := J^Dw. The entropy 
term Fi is given by 



N 

Fi = \n{27i) - {he + {F + H)Cp 

k=i 



Fci + GHc^fcn 

^ E+{F + H)C[ j 

.^.GR,^^. (26) 

Here E, F, G and H are additional order parameters conjugate to | J|, 
Q, R and 5*, so that in all (since \J\ is fixed) F has to be optimized for 
seven order parameters. 

For our "simplest segmented systems", see Eqs. (^, (|^), and (ITO), 
the general results from Eqs. (^), (|2l|) and (H), see also ([11]), (H, 
([13|) , specialize to 

Ra = r'' + Cpc'^d , Sa = l + Cpc: , Qab = q"' + c^qf , (27) 

with 

r'^ = N-^B ■ J" , q""^ = N-^J" ■ P , = 2A^-^ J °° ■ J ^'^ , 
qf = N~^{P'' ■ J^^ + J^" ■ P^) . (28) 
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Concerning the free energy, with the saddle-point approach and again 
with the rephca symmetry assumption, the entropy term speciahzes to 

Fi = ^{ln(27r) + ln[(g - 1 - + c,)(g - 1 + - c,)]} 

l-q + QdCs - c\ 

(g - 1 - + c^)(g - 1 + Q + c^) 
^ (r^ + eg) (g - 1 + Ctqd - QC^) + 2r Cdjct -qct + Cs- qd) ^29) 
{q-l- qd + Cs){q-l + qd + Cs){l- cl) 

which depends only on the five parameters r, c^, q, g, and g^, but not 
on Cp, whereas the energy contribution specializes to 

F2 = 4:J DwH{xiw) \nH{x2w) , (30) 

with 

r + c„Cd , , 

xi = ^ (31) 

(1 + CpQ) (g + Cpqd) - (r + CpCrf)^ 

= \ L " f\ V (32) 

y 1 + c^Cp - (g + Cpgd) 

Using the conditions dF/dr = dF/dcg = dF/dcd = dF/dq = 
OF/ dqd = one obtains the evolution of all interesting quantities. 

A major difference to the Hebb case can be seen from the asymptotic 
behaviour for a 00: For unstructured teacher perceptron (q = 0), 
the entropy term can be simplified, since then q = r, qd = Cd and 
Cs = 0. So one gets asymptotically q — > Cp • (1 — r) and 

1 

a'^C'^[l — c^) 

with C = {2n)-^/^ JdxH{x) \nH{x) ^ -0.360324. 

Thus with Gibbs learning in the case q = a perfect overlap, 
and thus perfect generalization, is reached for all values of Cp, in con- 
trast to Hebbian learning, where this was only the case when Cp = 
(except some limiting cases, see above). But the prefactor of the 1/a^- 
behaviour of Eqn. ( ^5] ) is proportional to (1 — c^), which means that 
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asymptotically for the overlap r, but not for the generalization ability 
itself (see below), spatial pattern correlations are still slightly detri- 
mental for the Gibbs case with q = 0, but only for the just-mentioned 
prefactor, whereas the "residual error" itself now vanishes, in contrast 
to the Hebb case. 

Let us concentrate on the generalization error now: In Fig. 3 this 
quantity is plotted for several values of |cp| (q = fixed), showing that 
the error becomes smaller with increasing \cp\ for all a. In other words: 
the more structured the pattern space the easier it is to actually learn 
the classification task given by the teacher rule. This is in contrast to 
the behaviour of r (see above) but intuitively reasonable, and can be 
understood a bit more thoroughly by the following consideration: 

If we perform a coordinate transformation in the pattern and phase 
space to diagonalize the correlation matrix (of the patterns) we have 
two eigenvalues 1 ± Cp determining the variance of the corresponding 
sites. This means that the sites with 1 — \cp\ are less significant than 
those with 1 + \cp\. Thus, the student can concentrate on the N/2 
latter ones to learn the task. Since these are only half as many as 
the whole set, learning can be performed faster. In the extreme case 
of |cp| = 1 the dimension of the system is effectively reduced to N/2, 
leading to a rescaling of a with the factor 2. It is clear that this 
reasoning can be transferred to more general segmentations and more 
complex architectures. 

The above considerations provide an alternative view on the learn- 
ing problem investigated here as well, i.e. pattern sets which can be 
decomposed into components of different magnitude. Data preprocess- 
ing using principal component analysis techniques makes use of such 
structures in practical applications [0, Thus, correlations should 
be helpful in general. 

Nevertheless, looking at the asymptotic behaviour for a ^ oo of 
the generalization error, which for Gibbs learning with q = is 

, , 1 , , 1 , 1 , 0.625 , , 
lim eia) = — arccosfr + c-oQ) = — arccosfl tt^) ~ , (34) 

we have a result which is independent of the pattern correlations at all. 
So, for large a, structure in the patterns has no advantage in terms of 
the generalization error. Actually, the fact that the improved general- 
ization ability due to structure in the pattern space is confined to an 
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intermediate a-regime can easily be understood: to reach perfect gen- 
eralization, the sites with eigenvalue 1 — |cp|, which are less significant 
at first, become important for a ^ oo to achieve the ultimate "fine 
adjustment". 

In the case of a correlated teacher vector (q 7^ 0) things change 
a bit. Fig. 4 shows the dependence of the generalization error on Cp 
for several values of Cj and fixed a = 2 (which is something like an 
intermediate value). We see that structure in the patterns can actually 
worsen the generalization ability, if the structure is in the opposite 
direction than the teacher correlation, i.e. for CpQ < 0. This resembles 
the behaviour of the Hebb rule, where such type of learning problems 
are difficult as well, and again the result can probably be transferred 
to more general situations: 

Looking at the simultaneously diagonalized correlation matrices the 
reason for this becomes clear: sites with the smaller variance 1 — |cp|, 
concerning the patterns, are related to teacher sites with the larger 
eigenvalue 1 + and therefore their loss in significance (due to a 
small value 1 — \cp\) is somehow compensated by the larger weights of 
the teacher vector. 

Although not analytically shown we expect from numerical evidence 
perfect generalization in the limit a ^ 00 to be achieved for q 7^ as 
well, again with the law given in (0). This means that correlations in 
the system asymptotically neither improve nor worsen the generaliza- 
tion behaviour if one uses good enough learning rules. 

Let us now break down the behaviour into the contributions from 
the several order parameters. Fig. 5a,b shows the evolution of r{a) for 
different values of Cp with q = and q = 0.9, respectively. For q = 
a higher correlation \cp\ leads to a smaller overlap. For q = 0.9 the 
behaviour depends on the sign of Cp as well. For small a the overlap 
r{a) is larger for CpCt > than for CpCt < 0; but for larger values of a 
the relation is opposite. To understand this "crossing behaviour" we 
have to notice that the magnitude of the local fields, and so of the 
stability of the patterns, is enhanced (reduced) for CpQ > (cpQ < 0): 

• For a <^ 1, a small stability (small on average) merely leads to a 
small bias of the version space away from the true teacher vector 
(since the training patterns lie near the classification boundary). 
The direction of this small bias is naturally such that the CpCt > 
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case yields higher overlap. 

• For a ^ 1 the biasing effect of the small stability disappears, since 
the patterns cover the space somehow dense. On the other hand, 
for CpCt > the phase space of the solutions is now more confined 
(g is smaller) because of the constraint of a higher stability (see 
the evolution of q{a) in Fig. 6) of the possible solutions. This 
leads to a smaller overlap r for CpCt < in case of a ^ 1. 

Fig. 7 shows the evolution of the student structure Cs{a) for a 
teacher correlation q = 0.9. It is interesting to see that opposite corre- 
lations in the patterns (compared to the teacher) forces the student to 
adopt the teacher structure rather rapidly with a similar explanation 
as given above for the evolution of r{a). 

The evolution of Cd{a) with Cp (Fig. 8 for the case q = 0) is non- 
monotonic, which generally occurs if \cp\ > \ct\. Asymptotically of 
course, the value Cd = Ct is approached. So a high correlation in the 
patterns (e.g. Cp>0.7) induce strong correlations of Q(a) in an interme- 
diate region around a ~ 1, which improve (worsen) the generalization 
ability in this regime for CpQ > (cpQ < 0). 

Finally we should mention that the independence of e{a — >■ oo) 
on the pattern correlations Cp, which we have shown analytically in 
Eq. (|3^) for Ct = 0, corrects a different result of Tarkowski and Lewen- 
stein, 0. For Q 7^ and Cp 7^ 0, because of the large number of order 
parameters, we did not yet succeed in calculating the limiting behaviour 
analytically, although it is probably unchanged. Again, in view of the 
results of 0, H, the result should also apply to the more complicated 
multilayer architectures treated in these papers, and should also be 
valid in the presence of certain classes of noise. 

In the following section we treat Bayesian learning with different 
priors, while the results for Adatron learning, which leads to maxi- 
mal stability but not to optimal generalization, will be discussed in a 
separate paper. 

5 Bayesian learning 

Bayesian methods are succesfuUy used for learning in neural networks, 
see [|18], 0] and |T^. In this approach a pattern is classified with 
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the purpose to minimize the probabihty of a 'wrong answer'. The 
framework requires the specification of a prior behef about the possible 
networks and a noise model defining their answer behaviour. 

More precisely, the noise model p{s\J,C,) defines the conditional 
probability of getting the answer s (correct or not) on a given pat- 
tern ^ for a general classifying automaton J ranging over some sample 
space. The probability p{D\J) of the data D comprising the whole 
training set is typically given by simply multiplying all probabilities for 
the single members of the training set, i.e. pairs of training-questions 
with 'correct answers', thus assuming that these pairs are given inde- 
pendently of each other, i.e. without semantical correlations, whereas 
spatial correlations may be included. 

The so-called prior p{J) defines the probability that the vector J 
describes the automaton, before the evidence of any data is taken into 
account, i.e. on the basis of some prior knowledge. Using the Bayes 
theorem we get 

Vm - (35) 

as the apostiori probability of J after absorbing the evidence of the 
training data. Here the so called evidence of the model V{D) := 
J2j p{J) p{D\J) serves for normalization. The "most probable correct 
answer" s' on a test- question ^' is then given by the weighted major- 
ity vote due to V{J\D) from (^). Here again we assume that the 
same spatial correlations C^^, see Eqn. (^), apply both to the training- 
questions and to the test-questions, while in both cases the "correct 
answers" are given by the same "teacher automaton" B, which is not 
specified explicitly in Eqn. ( ^5|) and principally can have an architec- 
ture different from that of the "student automaton" J (although in 
our case we assume the same architecture). Of course, we also assume 
that the "student" uses the same 'noise model' both for training and 
afterwards. 

In practice, a good choice of the noise model and the prior (which 
include the choice of the architecture used) is a crucial point for getting 
good generalization behaviour. One possibility for proper model selec- 
tion is to calculate the 'evidence' of several possible models, |T^, |TP|. 



Methods from statistical mechanics can be used to investigate sys- 
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terns in the thermodynamic hmit, see and concerning model se- 



lection |]2T]]. The purpose of this section is to compare the behaviour 
of Bayesian learning to Gibbs learning in the case of structured spaces 
on the one hand, and to investigate the influence of different priors on 
the other. As priors we use 

(1) a uniform prior over all normalized student coupling vectors; 

(2) a restricted prior permitting only those student weight vectors, 
which have the correct (and in this case assumed as known) correlation 
Cs = Ct- (If a sufficient number of training examples is given, the 
'student' can get knowledge of q by monitoring the spatial statistics Cp 
of the questions posed by the teacher and applying Hebbian learning 
for some time, i.e. for finite a, see above.) 

Since we are considering here a deterministic classification, the ap- 
propriate "noise model" gives probability 1 for the correct answer (due 
to the coupling vector J and the perceptron mapping rule) and oth- 
erwise. 

One should stress that these choices contain a rather large amount 
of prior knowledge about the possible teacher rules which is not in the 
same way available in practical problems. 

5.1 Relation to the Gibbs case 

In our case it is pretty easy to derive the Bayes properties from the 
already calculated quantities for the Gibbs case. This is possible since 
one can construct a perceptron from the Gibbsian version space V which 
performs like the Bayesian classification, namely the Central- Point- 
Perceptron (CP-perceptron). If the M members Ji of the version space 
carry identical a-priori probabilities, the CP-perceptron is simply 

1 M 

J^^ = hm E j7 . (36) 
Here K is chosen, such that jJ*"^! = N^^'^. Therefore 

1 M 

^'=1/"^ ^ E Ji-Jm = lim [M + M(M-l)-g] . (37) 

l,m=l 
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So one gets for the overlap 



.CP 



N 



1 M 

lim y Ji-B . 



(38) 



Since r := liniM^oo (NM) ^ J2 Ji - B is the overlap for the case of 
Gibbs learning, we have in this way the simple relations 



.CP 



r 



^CP 



Cd 



(39) 



Additionally, one needs the correlation between the two different seg- 
ments of the CP student perceptrons : 



(J^^)°-(J^'n 



CP\1 



N 



2 ^ 
lim — — V Jl^-J^ 



',m=l 



Mcs + M{M - l)qd 



(40) 



The fact that the CP-perceptron reaches the same generalization ability 
as the Bayes classification follows from 



and 



a 



a 



bayes 



CP 



sign((/ij)) 



(41) 



sign 




sign((sign(/ij))) . (42) 



In [|l], ^ it is proved for the case of vanishing pattern- and teacher- 
correlations (cp = Q = 0) that the generalization abilities obtained with 
the CP perceptron, Eqn. (^I]) , and the corresponding Bayes algorithm, 
Eqn.(|42D, respectively, agree for almost all ^ in the limit M — > oo, 
where additionally M -C iV is assumed. Probably the agreement of the 
generalization abilities is also true, if pattern- and teacher-correlations 
are included. 

We mention at this place that for two-layer perceptrons, in contrast 
to the present case, the CP-automaton does not reach the generaliza- 
tion ability of the Bayes process, except for the parity machine: The 
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reason for this exception is due to the 'chequered' structure of the 
mapping in the second layer of the parity machine (each flip of the 
output of only one hidden node changes the final classification from 
(+1) to (-1) and vice versa): This leads to the fact that for the par- 
ity machine exploring the phase-space around the CP solution by the 
Bayesian method gives just the same result as the CP-solution itself. 
The interested reader will find more details in 0. 

In the following we call the CP solution 'CPi-perceptron' if the uni- 
form prior (1) is used, 'CP2-perceptron' if only students with structure 
Cs = Ct are permitted, prior (2). 

5.2 Uniform prior 

The learning curves e{a) for this prior are shown in Fig. 9 for several 
Cp and Q = 0. For comparison the performance of the Gibbs algorithm 
is shown as well (cp = 0, Gibbs). 

The improvement compared to Gibbs learning is significant and 
remains asymptotically, i.e. one obtains for Ct = (and probably also 
for Q 7^ 0) a behaviour again independent from Cp, namely ||20[| : 

lim e'"'y^'(a) ^ ^ . (43) 

The influence of pattern correlations is similar as in the Gibbs case. 

Figs. 10a,b present results for the overlap r{a) between teacher- and 
CPi-perceptron for Q = and q = 0.9 as a function of Cp. Here, one 
finds similar behaviour as in the preceding section, but now somewhat 
more pronounced, namely (i) for q = the overlap decreases with 
increasing Cp] (ii) for q 7^ there is a crossing of the results near a ~ 2, 
and (iii) different signs of Cp and Ct lead to higher values of r for large a; 
probably this behaviour generalizes again to multilayer networks, see 
Ij^, |[. Fig. 11 deals with the internal structure of the CPi-perceptron, 
i.e. the internal overlap Cs{a) of it's two segments is presented, again 
for Ct = 0.9, for various values of Cp. For a ^ 00, Cs{a) converges 
to the internal structure of the teacher perceptron, i.e. Cs{a) Ct- 
The most prominent difference to the case of Gibbs learning is that 
here in the opposite limit a — >■ the CPi-perceptron takes the value 
of the spatial correlation of the patterns, i.e. Cs{a — > 0) ^ Cp. This has 
already been observed with the Hebb rule, see above, and also with 
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maximal-stability learning, [|I^, in connection with the simpler storage 
problem. 



5.3 Restricted prior 

Now let us look at the result if an enhanced prior knowledge is given, i.e. 
the internal structure q of the teacher. The Bayesian inference based 
on this prior has the best possible generalization performance since all 
available prior knowledge is used to minimize the error probability. 

In the averaging process defined by Eq. (^) only those members 
Ji of the version space are now taken into account, which fulfill the 
constraint Cg = Ct, i.e. which have the same correlation between the 
segments as the teacher. In this case the teacher is a typical member 
of the restricted version space, so we have q = r and = q. Thus, the 
expression for the free energy simplifies for the CP2-perceptron with 
Eqs. (H) and (M) to 



F = Extr,,e, { ^ [ln(27r) + ln((r - 1 + q - q) (r - 1 - q + q))] 

+ 1 + ~^ + 4a / DwH{x) Inij'(x)} , (44) 

with X = (r + CpQ)^/^(l + CpCt — r — CpCd)'^^^- Extremizing w.r. to r 
and Cd, one gets the quantities describing V in this case, and from them 
the behaviour of the CP2-perceptron. 

To check the performance we choose a high teacher correlation, 
Ct = 0.9. (Clearly, for smaller q the expected advantage should de- 
crease, since the actual restriction of the prior by imposing = Q is 
reduced). The results in Fig. 12 show the performance of the CPi and 
CP2 perceptron as a function of a for q = 0.9. For intermediate values 
of a, we observe in fact a quite big improvement of the CP2 results 
with respect to the CPi case . 



However it can be shown, p3[, that again asymptotically for a — > cxd 
the results are the same as for the uniform prior (1). This is a well- 
known effect in Bayesian learning: For large sizes of the training set the 
evidence of the examples dominates the influence of the prior, which 
becomes increasingly irrelevant (as long as - in our case - the correct 
teacher rule is included with finite probability). 
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6 Conclusions 



We have studied the generahzation properties of student perceptrons, 
which try to learn a "classification rule with spatial correlations", im- 
plemented by a teacher perceptron with built-in spatial correlations 
between the components of the coupling vector. 'Batch learning' is 
used, and the patterns are drawn from a spatially nonuniform distri- 
bution as well, allowing correlations between different sites, which can 
be different, however, from the above-mentioned spatial correlations 
of the teacher. We concentrated on the natural case of "segmented 
perceptrons" and "segmented patterns", where the correlations were 
those of corresponding sites in different segments, and where the dif- 
ferent correlation matrices involved in our formalism had at least the 
same eigenvectors ('quasi-segmented systems'). 

Using the replica method [|T6|, 0] with a replica symmetric ansatz. 



which is exact in this case, we obtained the behaviour of Gibbs and 
Bayesian learning in the thermodynamic limit. As a third learning al- 
gorithm we investigated the Hebb rule, and found that in the presence 
of correlations it is useful only for low loading and for exceptional limit- 
ing cases of vanishing or extreme correlation: Otherwise there remains 
a residual error for a ^ oo. However, due to its simplicity, the Hebb 
rule allows the easiest determination of the site-correlation measure q 
of the "teacher rule" by monitoring the pattern correlation Cp and the 
generalization error for finite a and comparing with Eqn. ([T7|). 

On the contrary, for the Gibbs and Bayes cases we find that the 
structure of the patterns and of the teacher machines does not mat- 
ter asymptotically for a — oo, and perfect generalization is achieved. 
Nevertheless in an intermediate a-regime the performance is quite sen- 
sitive to correlations which can improve or worsen the generalization 
ability. 

(We only mention at this place that we have verified some results by 
numerical implementation of the learning algorithms, which is difficult 
for Gibbs and Bayes processes: We simply used small systems, where 
the phase space was sampled by Monte Carlo methods; a more effective 
way allowing for larger systems is suggested in a recent preprint of Berg 
and Engel, 0.) 

Difficult learning cases are those with opposite correlations in the 
patterns and the teacher vector, respectively. For the Hebb rule the 
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residual error is high, for the other learning rules the generalization 
error is high for intermediate a. 

These effects can be understood better by viewing the scenario as 
a learning problem with different magnitudes for different components 
of patterns and teacher vectors. This consideration relate things to 
methods like principal component analysis. Here an interesting and 
practical extension would be to investigate the influence of noise, whose 
disturbing influence should depend on the relation between its size 
and the corresponding magnitudes of pattern and teacher-vector sites, 
see for multilayer networks with noise, but still for uncorrelated 
patterns. 

For the Bayesian case we investigated the influence of different pri- 
ors, showing that improved prior knowledge (e.g. based on a knowledge 
of the just mentioned quantity q) enhances the performance, but again 
only for an intermediate regime of a. This corresponds to the well 
known fact that prior information looses significance for large training 
sets. 

The case of Maximum-Stability learning, where the AdaTron al- 
gorithm of Anlauf and Biehl provides a fast and effective learning al- 



gorithm, p5[, and a related cavity method, will be the themes of a 



following paper. 
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Figure Captions 

Fig. 1: For Hebbian learning with a correlation parameter q = 0.7 of 
the two segments of the teacher perceptron, the generalization error 
e{a) is presented as a function of the reduced size a :— p/N of the 
training set for different values of the pattern correlation parameter Cp. 

Fig. 2: The limit of the generalization error for a — > oo in case 
of Hebbian learning is presented as a function of the pattern correla- 
tion parameter Cp for different values of the correlation ct of the two 
segments of the teacher perceptron. 

Fig. 3: For the case of Gibbs learning, the generalization error e{a) 
is presented as a function of the reduced size a :— p/N of the training 
set, for Ct = and different values of \cp\. 

Fig. 4: For the case of Gibbs learning and a — 2, the generahza- 
tion error e(cp) is presented as a function of the pattern correlation 
parameter Cp for different values of Cj. 

Fig. 5a,b: For Gibbs learning with q = and q = 0.9, respectively, 
the normahzed overlap r{a) of the coupling vectors of the teacher's and 
the student's perceptron is presented as a function of the reduced size 
a := p/N of the training set. 

Fig. 6: For Gibbs learning with q = 0.9, the order parameters 
q{a), which is the typical overlap between the coupling vectors of two 
different student perceptrons, and r{a), which is the overlap between 
the coupling vectors of a typical student and the teacher, arc presented 
as a function of the reduced size a :— p/N of the training set for the 
two cases of Cp = ±0.9. 

Fig. 7: For Gibbs learning, the evolution of the correlation param- 
eter Cs{a) between the two segments of the student perceptron, as it 
develops as a function of the reduced size a := p/N of the training set, 
is presented over a for q = 0.9 and Cp = 0, ±0.7 and ±0.9. 

Fig. 8: For Gibbs learning, the evolution of the cross-correlation 
parameter of two different segments of the teacher's and the student 
perceptron, see Eqn. (13), as it develops as a function of the reduced 
size a := p/N of the training set, is presented over a for q = 0.9 and 
Cp = 0.2, 0.7 and 0.9. 

Fig. 9: For Bayesian learning with uniform prior, i.e. the CPi per- 
ceptron, and Ct — 0, the generalization error e{a) is presented over the 
reduced size a -.— p/N of the training set, for Cp = 0, 0.7 and 0.9, and 
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for comparison also for Gibbs learning with Cp = 0. 

Fig. 10a,b: For Bayesian learning with uniform prior, i.e. the CPi 
perceptron, for the two cases Ct = and q = 0.9, the overlap r{a) 
between the couphng vector of the teacher and the CPi student per- 
ceptron is presented as a function of the reduced size a :— p/N of the 
training set, for pattern-correlations Cp = 0, ±0.7 and ±0.9. 

Fig. 11: For Bayesian learning with uniform prior, i.e. the CPi 
perceptron, for q = 0.9, the evolution of the correlation Cs{a) between 
the two different segments of the CPi student perceptron is presented, 
as it evolves as a function of the reduced size a : = p/N oi the training 
set, for pattern-correlations Cp = 0, 0.2, 0.7 and 0.9. 

Fig. 12: For Bayesian learning with restricted prior, i.e. the CP2 
perceptron, for q = 0.9, the generalization error e{a) is presented as a 
function of the reduced size a :— p/N oi the training set, for pattern 
correlations Cp = 0, 0.7, 0.9, and also, for comparison, with unrestricted 
prior (i.e. the CPi perceptron) and Cp = 0. 
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generalization error e(a) for Ct=0.7 
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q(a) and r(a) for Ct=0.9 
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